Accelerated data transfer using common prior data segments

ABSTRACT

Accelerating data transfers is described herein. When a second computing system is requested to transfer a file to a first computing system, a data segment is sent to the first computing system instead of the entire file. The data segment is then compared to data stored within a data store on the first computing system. If the data segment and data within the data store match, then the file does not need to be transferred, and a pointer points to the file already located on the first computing system. If the data segment does not match any data stored in the data store, then the file is transferred from the second computing system to the first computing system. By comparing only the data segment instead of sending an entire file, data transfer is able to be greatly expedited in situations where the data is common between systems.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 11/525,729 filed on Sep. 22, 2006, entitled “Accelerated data transfer using common prior data segments,” which issued as U.S. Pat. No. 9,317,506 on Apr. 19, 2016, and which application is expressly incorporated herein by reference in its entirety.

BACKGROUND OF THE INVENTION

1. The Field of the Invention

The present invention relates to the field of data transfers. More specifically, the present invention relates to the field of transferring data using common prior data segments.

2. The Background of the Invention

As networking technologies grow, including the Internet, so do their capabilities and requirements. For many years, users dialed up to access the Internet at speeds of 14.4 kilobits per second (kps), then 28.8 kps and 56.6 kps. Then ISDN lines made 128 kps a possibility. Currently cable modems and DSL provide extremely fast connections with high bandwidth to home users. Other technologies such as T1 and T3 lines provide possibly even faster connections and are usually implemented by businesses and universities. As these technologies are increasing in capabilities, so too is the size and amount of the data traveling from one networked device to another. For example, when 14.4 kps connections were prominent, a file of a few hundred kilobytes was considered quite large and took a while to download. With current broadband technologies utilizing cable modems and DSL, a file of multiple megabytes is able to be downloaded in a few minutes. Hence, technology has improved substantially, enabling larger files to be downloaded in a short amount of time. However, many data files are currently in the range of gigabytes such as movie files which could take hours to download even on fast connections and would take days with older dial-up connections. Since the Internet and other networks are being used to couple everything together lately, even toasters and refrigerators, many attempts have been made to make network connections more efficient utilizing data processing techniques.

One technique is to compress the data before sending it over the network. However, that has its drawbacks of adding steps of compressing the data and before it is sent and uncompressing the data after it is received, simply adding time to the process in a different way. Furthermore, since many files like .mp3s are already compressed yet still quite large, compressing them again will do little if anything to improve network speed.

Another technique is described in U.S. Patent App. No. 2004/0148306 to Moulton, et al. Moulton describes a hash file system that is based and organized upon hashes and which is able to eliminate redundant copies of aggregate blocks of data or parts of data blocks from the system. The hash file system taught by Moulton utilizes hash values for computer files or file pieces which are produced by a checksum generating program, engine or algorithm. The hash file system as taught by Moulton is able to be used as a network accelerator by sending hashes for the data instead of the data itself

BRIEF SUMMARY OF THE INVENTION

Accelerating data transfers is described herein. When a second computing system is requested to transfer a file to a first computing system, a data segment is sent to the first computing system instead of the entire file. The data segment is then compared to data stored within a data store on the first computing system. If the data segment and data within the data store match, then the file does not need to be transferred, and a pointer points to the file already located on the first computing system. If the data segment does not match any data stored in the data store, then the file is transferred from the second computing system to the first computing system. By comparing only the data segment instead of sending an entire file, data transfer is able to be greatly expedited in situations where the data is common between systems.

In one aspect, a method of accelerating data transfer comprises storing data in a data store on a first computing system wherein the data corresponds to one or more files stored on the first computing system, transferring a data segment from a source file from a second computing system to the first computing system over a network, scanning the data store for the data segment, generating one or more pointers to the one or more corresponding files of one or more matching data segments, if the one or more matching data segments are identified in the data store and transferring a copy of the source file, if the one or more matching data segments are not identified in the data store. The data store is a database. The first computing system is a target system and the second computing system is a source system. The first computing system is a server and the second computing system is a client system. The client system is selected from the group consisting of a personal computer, a PDA, a cell phone, a laptop, a thin client, a Mac computer, an mp3 player and a gaming console. Alternatively, the first computing system is a first client system and the second computing system is a second client system. The data segment is one or more cyclic redundancy checks and the data in the data store includes cyclic redundancy checks and the data segment and the data are compared. Alternatively, the data segment is a unique database key and the data in the data store includes database keys and the data segment and the data are compared. Alternatively, the data segment is a hash and the data in the data store includes hashes and the data segment and the data are compared. The data store grows as more files are stored on the first computing system. The files stored on the first computing system are minimized by implementing the data store. One or more additional computing systems are coupled to the first computing system. The method further comprises transferring only a first section of the source file when only a second section of the source file is found within the data store. A standard operating system and file system are utilized on the first computing system and the second computing system.

In another aspect, a system for accelerating data transfer comprises a first computing system for storing one or more files and a data store for storing data corresponding to the one or more files and a second computing system coupled to the first computing system, wherein a data segment is compared to the data within the data store on the first computing system after being received from the second computing system, further wherein a pointer to the one or more files is added on the first computing system if the data segment is found within the data store, but a copy of a source file corresponding to the data segment is transferred from the second computing system to the first computing system if the data segment is not found in the data store. The data store is a database. The first computing system is a target system and the second computing system is a source system. The first computing system is a server and the second computing system is a client system. The client system is selected from the group consisting of a personal computer, a PDA, a cell phone, a laptop, a thin client, a Mac computer, an mp3 player and a gaming console. Alternatively, the first computing system is a first client system and the second computing system is a second client system. The data segment is one or more cyclic redundancy checks and the data in the data store includes cyclic redundancy checks and the data segment and the data are compared. Alternatively, the data segment is a unique database key and the data in the data store includes database keys and the data segment and the data are compared. Alternatively, the data segment is a hash and the data in the data store includes hashes and the data segment and the data are compared. The data store grows as more files are stored on the first computing system. The files stored on the first computing system are minimized by implementing the data store. The system further comprises one or more additional computing systems coupled to the first computing system. Only a first section of the source file is transferred when only a second section of the source file is found within the data store. A standard operating system and file system are utilized on the first computing system and the second computing system. The system further comprises a network coupling the first computing system and the second computing system.

In another aspect, a network of systems for accelerating data transfers comprises one or more source systems for transferring a data segment corresponding to a source file stored on the one or more source systems, one or more target systems coupled to the one or more source systems for storing data in a data store corresponding to one or more files and for comparing the data segment received from the one or more source systems with the data in the data store where if the data segment is found, a pointer is generated to point to a corresponding file in the one or more files on the target system instead of transferring the source file over a network. The data store is a database. The one or more target systems are one or more servers and the one or more source systems are one or more client systems. The one or more client systems are selected from the group consisting of personal computers, PDAs, cell phones, laptops, thin clients, Mac computers, mp3 players and gaining consoles. The data segment is one or more cyclic redundancy checks and the data in the data store includes cyclic redundancy checks and the data segment and the data are compared. Alternatively, the data segment is a unique database key and the data in the data store includes database keys and the data segment and the data are compared. Alternatively, the data segment is a hash and the data in the data store includes hashes and the data segment and the data are compared. The data store grows as more files are stored on the one or more target systems. The files stored on the one or more target systems are minimized by implementing the data store. Only a first section of the source file is transferred when only a second section of the source file is found within the data store. A standard operating system and file system are utilized on the one or more target systems and the one or more source systems.

In yet another aspect, a storage system configured to receive data from a plurality of computing systems comprises one or more files, a set of information corresponding to the one or more files and a data store for storing the set of information, wherein a data segment received from a source system is compared with the set of information stored within the data store and a pointer is generated to point to a corresponding file in the one or more files if the data segment is found but if the data segment is not found within the data store, a copy of a source file corresponding to the data segment is transferred. The data store is a database. The data segment is one or more cyclic redundancy checks and the set of information in the data store includes cyclic redundancy checks and the data segment and the set of information are compared. Alternatively, the data segment is a unique database key and the set of information in the data store includes database keys and the data segment and the set of information are compared. Alternatively, the data segment is a hash and the set of information in the data store includes hashes and the data segment and the set of information are compared. The data store grows as more files are stored on the storage system. The files stored on the storage system are minimized by implementing the data store. Only a first section of the source file is transferred when only a second section of the source file is found within the data store. A standard operating system and file system are utilized on the storage system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a graphical representation of a configuration of an embodiment of the present invention.

FIG. 2 illustrates an exemplary graphical representation of files and directories stored in a target system in an embodiment of the present invention.

FIG. 3 illustrates a graphical representation of a network of systems configured in the present invention.

FIG. 4 illustrates a flowchart of an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

A system for and method of accelerating data transfers over a network is described herein. In the past, data was transferred with a minimal check to determine if the data is already located on the destination system. Essentially, a check was made whether a file with the same file name is located in the location of the desired destination. For example, if a user was copying movie.avi from his personal computer to a folder entitled “comedy” on a backup server used for people to store their movies, the server only checks if movie.avi exists in the “comedy” folder. However, there are a number of problems with this. The main one is that the file movie.avi could already be located on the server just in a different folder. It would be a waste of network resources to copy the entire movie.avi file, considering a typical movie file is a few hundred megabytes or possibly gigabytes. Using the present invention, only a data segment is sent from the user's computer to the server, and then the server searches its system and locates the preexisting movie.avi and simply generates a pointer to it. Thus, only a very small amount of data is sent over the network instead of a huge movie file, and each file is only stored a single time on the storage system.

FIG. 1 illustrates a graphical representation of a configuration of an embodiment of the present invention. A source computing system 100 is coupled to a target computing system 120 through a network 110. Both the source computing system 100 and the target computing system 120 are able to be any computing system with the ability to transfer data to another system. Such computing systems include but are not limited to, personal computers, laptops, servers, thin clients, cell phones, PDAs, Mac computers, mp3 players and gaming consoles. The network 110 is able to be a Local Area Network (LAN), Wide Area Network (WAN), Metropolitan Area Network (MAN), the Internet, or any other type of network. Although the configuration in FIG. 1 shows the two systems coupled through the network 110, it is possible for the source computing system 100 and the target computing system 120 to be directly coupled to each other. Within the source computing system 100 are standard computing elements including a hard drive 102 where files 104 are stored on a source file system 106. In some embodiments, the hard drive 102 is not a standard hard disk drive but another type of storage system including, but not limited to, a compact disc, a DVD, an optical drive, a network drive or a Redundant Array of Inexpensive Disks (RAID). When a user desires to transfer a file, a source file 104′ is selected by the user. For example, a file named, movie.avi, is selected to be transferred. However, unlike past implementations of transferring data, the process does not begin by transferring the entire source file 104′.

After the source file 104′ is selected to be transferred, a data segment 112 of the source file 104′ is sent across the network 110. In an embodiment, the data segment 112 is a section of the source file 104′. Using the movie.avi example, only a section of the file is sent over the network. In other embodiments, the data segment 112 is a different representation of the data such as a hash or a sliding Cyclic Redundancy Check (CRC) of the source file 104′. In other embodiments, other similar implementations are used where a representation of the source file 104′ is sent over the network 110 instead of the entire file. Additionally, representations of parts of the source file 104′ are able to be sent.

The target computing system 120 similarly has standard computing components including a hard drive 122. In some embodiments, the hard drive 122 is not a standard hard disk drive but another type of storage system as described above. Within the hard drive 122 is a standard operating system such as Microsoft.R™ Windows XP and a standard file system 128 such as New Technology File System (NTFS) where one or more files 124 are stored. In alternate embodiments, the file system is a non-standard file system. The file system 128 also contains a data store 126. The file system 128 utilizes typical structures such as directories or folders to store the files 124. The data store 126 is an implementation that is able to store data 126′ in an organized manner so that it is searchable. In some embodiments, the data store 126 is a database. The data 126′ stored within the data store 126 corresponds to the files 124 stored in the file system 128. For example, since a movie.avi file 124′ is stored within the hard drive 122, the data store 126 contains data 126′ corresponding to movie.avi. The data 126′ within the data store 126 depends on the embodiment implemented wherein some embodiments store segments of files, hashes, CRCs, unique database keys and/or other similar implementations of data representation.

The data segment 112 sent from the source computing system 100 is received by the target computing system 120. The target computing system 120 then searches within the data store 126 for a matching data segment. Continuing with the movie.avi example, a matching section of the movie.avi file is searched for within the data store 126. Since the data store 126 contains the movie.avi data 126′, a match is found. Hence, the system knows that the movie.avi file already exists on the target computing system 120. The target computing system 120, then sends a status 114 or some form of response to the source computing system 100 indicating that the file is already located at the target computing system 120. In the situation where the source file 104′ is already located at the target computing system 120, the source computing system 100 does not need to send any more data, and the target computing system 120 adds a pointer or indicates in some way where the data is located, so that the user copying the data is able to retrieve it later on. If the source file 104′ is not located on the target computing system 120, then the status 114 sent back indicates as such. At that point, a copy 116 of the source file 104′ is sent from the source computing system 100 to the target computing system 120. Once the new file is received on the target computing system 120, it is stored with the rest of the files 124 and a representation is stored within the data store 126, so that in the future when a user wants to copy that same file, the target computing system 120 will know that it is there and is able to expedite the data transfer by not having to actually transfer the entire file.

FIG. 2 illustrates an exemplary graphical representation of files and directories stored in a target system in an embodiment of the present invention. Within the example, two users' directories are shown, Brian and Paul. Within each user's storage area, there are four directories: documents, pictures, music and movies, each for storing files related to their respective category. Files that include just the file name within FIG. 2 signify that they are the only file containing that data on the system. For example, Brian's Documents directory contains resume.doc and report.doc which, as expected, are specific to his personal information, so there are no copies of that information found elsewhere on the system. This also means that when Brian transferred these files to the target computing system from his source computing system, the entire files were copied. However, there are types of data where duplicates are commonly found such as music and movies. These common files are the ones that are able to improve network data transfers by not actually copying the entire file and instead linking to the appropriate file already located on the target computing system. Files with a box around them with an arrow pointing outward such as DMB1.mp3, DMB2.mp3 and DMB3.mp3 within Brian's Music directory indicate that those files are actually pointers or links to another file on the system. Here, in Paul's Music directory, he also has DMB1.mp3, DMB2.mp3 and DMB3.mp3 amongst other music files. Paul copied his files before Brian, so his copying included transferring all of the file contents over the network. However, when Brian initiated his transfer, the system found Paul's copies using the methods described herein and instead of transferring the entire files, generated a pointer to Paul's files which are denoted by a box with an arrow pointing inward since they are being linked to. Furthermore, from the user's perspective, the files appear the same, even though there are no actual files with music data stored within Brian's Music directory on the system. The process continues as the users transfer files to the system, and when a file is copied determines whose directory includes the actual data and whose directory includes a pointer to data elsewhere on the system. As shown in FIG. 2, Brian copied Spider-Man.avi and then some time later, Paul did as well. Since Brian made his transfer first, the actual data is stored in his directory.

In some embodiments, the data is not stored in a user's directory, but is stored centrally so that everyone has pointers to the data. This alleviates the issue of one user deleting the file while the other user still wants it to remain. For example if Paul deletes Crash.avi, since the actual movie content is stored in his directory, Brian's pointer would point to nothing if the file is removed from Paul's directory. Using a central storage system where each user points to the central storage, the actual data would not be deleted, just Paul's link to the data, and Brian's link would remain intact. Another embodiment still stores the files in the individual locations, but also keeps track of whom is pointing to the files as well. Therefore, if the user with the actual content deletes it, the file is transferred to another user whose link is pointing to the data. The pointers pointing to the file are reconfigured to point to the data's new location. By transferring the data to another user before the actual data is deleted, this safeguards that the actual data is not lost when other users still want the file.

The above example is not meant to limit the present invention in any way. Although only two users are described, any number of users are able to store data on a system. Furthermore, the number of directories and the directory names are variable as well. The file types are not restricted to those described in the example either; any file types are able to be used. Also, when the files are linked, the filenames do not have to be the same. Comparisons performed by the methods described herein focus on the content of the data not the filenames. Hence, if a filename is Spider-Man.avi on a target and the source filename is Spiderman.avi, but they have the same content, the system is able to recognize they are the same file. The converse is true as well, that just because two files have the same filename, does not mean they have the same content, so links will not incorrectly point to the wrong data as they will not have the same content.

By implementing the present invention, not only are data transfers accelerated, but storage requirements are reduced as well. Using the example in FIG. 2, there are three music files and two movie files that would have been contained as two separate copies in conventional systems. Assuming the music files are 5 MB and the movie files are 1 GB, that is over 2 GB of data being stored in duplicate. Furthermore, since data on network systems are typically backed up, 4 GB of space is being wasted. Using the present invention, where a few bytes are used to point to the data, over 4 GB of space is saved. Furthermore, since this example only shows two users with a small number of files, the space savings on a large system with thousands of users could be extremely large.

FIG. 3 illustrates a graphical representation of a network of systems configured in the present invention. As described above, the present invention includes one or more source computing systems and one or more target computing systems. The source computing system is where the data to be transferred is located, and the target computing system is where the data will be stored after the transfer. Although FIG. 1 illustrates one source computing system and one target computing system, a network of systems 300 is able to include any number of source and target computing systems. FIG. 3 illustrates the computing systems coupled by a network 302. The computing systems include, but are not limited to, a server 304, a personal computer 306, a PDA 308, a cell phone 310, a laptop 312, a thin client 314, a Mac computer 316, an mp3 player 318 and a gaming console 320. Generally, the target computing systems are servers and the source computing systems are personal computers, PDAs, cell phones, laptops, thin clients, Mac computers, mp3 players and gaming consoles. However, any of the systems are able to be either the source or the target.

As an example, a typical configuration for use at a business includes one or more servers 304 as the target systems where users are able to back up their data. The employees then utilize one or more personal computers 306, PDAs 308, cell phones 310 and laptops 312 as the sources for the data. As data is backed up onto the server 304, the accelerated data transfer described herein is utilized. Fewer servers are required because the inefficiencies of duplicated data are resolved. Furthermore, there is less traffic on the network because transfers are much more efficient. Hence, in this setting it is reasonable to have the server be the target computing system and the other systems be the source computing systems.

It is possible though to have the roles of the systems switched or modified. For example, in a home network, a user is able to couple his cell phone, PDA, gaming system and personal computer together where the personal computer is the target system and his cell phone, PDA and gaming system are the source systems.

FIG. 4 illustrates a flowchart of an embodiment of the present invention. In the step 400, data is stored in a data store on a first computing system also referred to as a target computing system. Additionally, files corresponding to the data stored in the data store are also stored on the first computing system. In the step 402, a data segment is transferred from a second computing system or a source computing system to the first computing system. Generally a user selects a file to be transferred from the second computing system to the first computing system, and the data segment is a part of the file, a hash of the file and/or a CRC of the file. In the step 404, the data segment transferred is compared with the data in the data store. Comparing includes scanning the data store for the data segment and then identifying matching data in the data store. In the step 406, if a match is found then a pointer is generated to point to the corresponding file or files in the step 408. However, if a match is not found in the step 406, then a copy of the source file is transferred from the second computing system to the first computing system.

Although the present invention has been described where a data segment is compared to data, and then a link is generated to point to the entire file corresponding with the data, sections of files are able to be matched as well where the entire file is not the same. For example, sometimes additional data is included at the beginning or end of a music or movie file making the file slightly different from one that has very similar contents. Or, for example, one person has a fifteen second clip of a five minute long video, so the fifteen second clip is contained within the file of the long video. Such sections of data are able to be compared and matched by the present invention using a section of the file or a CRC or hash of a section of the file. In those instances, instead of transferring the entire file across the network because there is some offset or slight difference between the data, the present invention copies the data from the file residing on the target system. The sections of the file that are not already existing on the target system are transferred over the network, and the file is combined to generate the file initially intended to be transferred. In another embodiment, a master file is stored on the target system where the master file contains more data than a smaller file which only contains a portion of the master file. A pointer then points to the correct sections of the master file to represent the smaller file.

To utilize the present invention a user selects a file or files on a source computing system to be transferred over a network to a target computing system. In some embodiments, a user is not required to initiate the data transfer and the transfer is automated. The target computing system performs the necessary search to determine if any common data is already located on the target computing system. If there is common data, then the file is not transferred or only a portion of the file that is not common is transferred, and a pointer points to the common data. When a user views the data on the target computing system, the appearance is no different whether the file was transferred or is pointed to by a pointer. Furthermore, the present invention is able to be utilized without a specially modified file system.

In operation, users experience accelerated data transfers, but otherwise do not have to modify their ways of transferring data. After a user initiates the data transfer, the target computing system receives a data segment representing the file on the source computing system. The target computing system then compares the data segment with data stored within a data store by scanning the data store for a match. If a match is found, then the source file is not actually transferred over the network, and a pointer is generated on the target computing system. If the target computing system does not locate matching data, then the source file is transferred over the network. By expediting transfers of common data, network efficiency increases greatly in addition to storage requirements being reduced.

The present invention has been described in terms of specific embodiments incorporating details to facilitate the understanding of principles of construction and operation of the invention. Such reference herein to specific embodiments and details thereof is not intended to limit the scope of the claims appended hereto. It will be readily apparent to one skilled in the art that other various modifications may be made in the embodiment chosen for illustration without departing from the spirit and scope of the invention as defined by the claims. 

What is claimed is:
 1. A computing system configured to implement a method of accelerating data transfer, the method including: detecting a selection of files on a source computing system to be transmitted to a target computing system over a network; determining that the selection of files includes at least some common data that already located at the target computing system; and in response to the determination, refraining from transmitting the common data to the target computing system, but wherein a pointer is generated for the common data.
 2. The computing system of claim 1, wherein the pointer is generated by the source computing system and transmitted to the target computing system.
 3. The computing system of claim 1, wherein the pointer is generated by the target computing system.
 4. The computing system of claim 1, wherein the pointer is stored at the target computing system.
 5. The computing system of claim 1, wherein the common data is only a limited part of a single file.
 6. A method of accelerating data transfer from a source computing system to a target computing system over a network, the method comprising: detecting a selection of files on the source computing system to be transmitted to the target computing system; determining that the selection of files includes at least some common data that already located at the target computing system; refraining from transmitting the common data to the target computing system; and generating and transmitting a pointer to the common data.
 7. The method of claim 6, wherein the method further includes transmitting uncommon data that includes at least part of the selection of files which that is not already located at the target computing system.
 8. The method of claim 7, wherein the pointer is transmitted with the uncommon data.
 9. The method of claim 6, wherein the selection of files is a user initiate selection of files.
 10. The method of claim 6, wherein the selection of files is automated.
 11. The method of claim 6, wherein the common data is only a limited part of a single file.
 12. A method of accelerating data transfer from a source computing system to a target computing system over a network, the method comprising: detecting a selection of files on the source computing system to be transmitted to the target computing system; determining that the selection of files includes at least some common data that already located at the target computing system; and refraining from transmitting the common data to the target computing system, wherein a pointer is generated for the common data.
 13. The method of claim 12, wherein the pointer is generated by the target computing system.
 14. The method of claim 12, wherein the pointer is stored at the target computing system.
 15. The method of claim 1, wherein the common data is only a limited part of a single file. 